Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

نویسندگان

  • Mathias Creutz
  • Krista Lagus
چکیده

This work presents an algorithm for the unsupervised learning, or induction, of a simple morphology of a natural language. A probabilistic maximum a posteriori model is utilized, which builds hierarchical representations for a set of morphs, which are morpheme-like units discovered from unannotated text corpora. The induced morph lexicon stores parameters related to both the “meaning” and “form” of the morphs it contains. These parameters affect the role of the morphs in words. The model is implemented in a task of unsupervised morpheme segmentation of Finnish and English words. Very good results are obtained for Finnish and almost as good results are obtained in the English task.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Acquisition of Large Scale Categorial Grammar Lexicons

A system is presented for inducing Categorial Grammar (CG) lexicons for natural language from either unannotated or minimally annotated corpora extracted from the Penn Treebank. A combination of symbolic and stochastic methods have been used to build a computationally e ective and psychologically plausible system, which learns linguistically useful lexicons. There are a variety of parameters in...

متن کامل

Poor Man’s Word-Segmentation: Unsupervised Morphological Analysis for Indonesian

We present a partially new fully unsupervised algorithm for morphological segmentation of a arbitrary natural language with only one-slot concatenative morphology. The behaviour of the algorithm is examined in detail for Indonesian as it is a good approximation of such a language. The underlying theory makes no assumptions on whether the language is prefixing or suffixing, or whether affixes ar...

متن کامل

An Empirical Approach to Conceptual Case Frame Acquisition

Conceptual natural language processing systems usually rely on case frame instantiation to recognize events and role objects in text. But generating a good set of case frames for a domain is timeconsuming, tedious, and prone to errors of omission. We have developed a corpus-based algorithm for acquiring conceptual case frames empirically from unannotated text. Our algorithm builds on previous r...

متن کامل

Expanding lexicons by inducing paradigms and validating attested forms

One of the bottlenecks in Natural Language Processing for a given language is creating a lexicon that covers the language. The morphological lexicon provides two important pieces of information for NLP applications: 1) the normalization of a word, its lemmatization, which allows the application to recognize two variants of the same word; and 2) the part-of-speech roles that the word can play, w...

متن کامل

Automatically Extending the Lexicon for Parsing

This paper describes a method for automatically extending the lexicon of wide-coverage parsers. The method is an extension to the automatic detection of coverage problems of natural language parsers, based on large amounts of raw text (van Noord 2004). The goal is to extend grammar coverage, focusing in particular on the acquisition of lexical information for missing and incomplete lexicon entr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005